Dependency parsing with spaCy

This script takes Unicode plain text and outputs its dependencies in CoNLL10 format. It was originally written to prepare input files for named/non-named entity extraction with xrenner.

For installation instructions for spaCy, see https://spacy.io/docs#getting-started.


In [1]:
import spacy

Load the English tagger

Note: Loading the tagger is expensive. The documentation says it can take 10-20 seconds and 2-3 GB of RAM.


In [2]:
nlp = spacy.load('en')

Give spaCy some input text, then process it.

Note: SpaCy input has to be in Unicode.


In [3]:
text = u'''1. Cato's family got its first lustre and fame from his great-grandfather Cato (a man whose virtue gained him the greatest reputation and influence among the Romans, as has been written in his Life), but the death of both parents left him an orphan, together with his brother Caepio and his sister Porcia. Cato had also a half-sister, Servilia, the daughter of his mother.1 All these children were brought up in the home of Livius Drusus, their uncle on the mother's side, who at that time was a leader in the conduct of public affairs; for he was a most powerful speaker, in general a man of the greatest discretion, and yielded to no Roman in dignity of purpose.
[2] We are told that from his very childhood Cato displayed, in speech, in countenance, and in his childish sports, a nature that was inflexible, imperturbable, and altogether steadfast. He set out to accomplish his purposes with a vigour beyond his years, and while he was harsh and repellent to those who would flatter him, he was still more masterful towards those who tried to frighten him. It was altogether difficult to make him laugh, although once in a while he relaxed his features so far as to smile; and he was not quickly nor easily moved to anger, though once angered he was inexorable.'''

In [4]:
doc = nlp(text)

CoNLL10 output

SpaCy's output--in particular, its token IDs--takes some massaging in order to produce a well-formed CoNLL10 document.

The column layout is described here.


In [5]:
for sent in doc.sents:
    # Create lookup dict for token IDs.
    ids = {}
    for i, token in enumerate(sent):
        ids[token.idx] = i+1
        
    for token in sent:
        # Clean up token attributes
        token_id = str(ids[token.idx]).strip()
        token_text = str(token).strip()
        lemma = str(token.lemma_).strip()
        pos_tag = str(token.tag_).strip()
        depend = str(token.dep_).strip()
        
        # Set head ID correctly for root of sentence.
        if token.dep_ == 'ROOT':
            head_id = str(0)
        else:
            head_id = str(ids[token.head.idx]).strip()
        
        # CoNLL10 output
        # Comments below are modified from https://corpling.uis.georgetown.edu/xrenner/doc/using.html#input-format
        print(token_id + '\t' +      # token ID w/in sentence
              token_text + '\t' +    # token text
              lemma + '\t' +         # lemmatized token
              pos_tag + '\t' +       # part of speech tag for token
              pos_tag + '\t' +       # part of speech tag for token
              '_' + '\t' +           # placeholder for morphological information
              head_id + '\t' +       # ID of head token
              depend + '\t' +        # dependency function
              '_' + '\t' + '_')      # two unused columns


1	1	1	CD	CD	_	0	ROOT	_	_
2	.	.	.	.	_	1	punct	_	_
1	Cato	cato	NNP	NNP	_	3	poss	_	_
2	's	's	POS	POS	_	1	case	_	_
3	family	family	NN	NN	_	4	nsubj	_	_
4	got	get	VBD	VBD	_	0	ROOT	_	_
5	its	its	PRP$	PRP$	_	7	poss	_	_
6	first	first	JJ	JJ	_	7	amod	_	_
7	lustre	lustre	NN	NN	_	4	dobj	_	_
8	and	and	CC	CC	_	7	cc	_	_
9	fame	fame	NN	NN	_	7	conj	_	_
10	from	from	IN	IN	_	4	prep	_	_
11	his	his	PRP$	PRP$	_	15	poss	_	_
12	great	great	JJ	JJ	_	14	amod	_	_
13	-	-	HYPH	HYPH	_	14	punct	_	_
14	grandfather	grandfather	NN	NN	_	15	compound	_	_
15	Cato	cato	NNP	NNP	_	10	pobj	_	_
16	(	(	-LRB-	-LRB-	_	15	punct	_	_
17	a	a	DT	DT	_	18	det	_	_
18	man	man	NN	NN	_	15	appos	_	_
19	whose	whose	WP$	WP$	_	20	poss	_	_
20	virtue	virtue	NN	NN	_	21	nsubj	_	_
21	gained	gain	VBD	VBD	_	18	relcl	_	_
22	him	him	PRP	PRP	_	21	dobj	_	_
23	the	the	DT	DT	_	25	det	_	_
24	greatest	great	JJS	JJS	_	25	amod	_	_
25	reputation	reputation	NN	NN	_	21	dobj	_	_
26	and	and	CC	CC	_	25	cc	_	_
27	influence	influence	NN	NN	_	25	conj	_	_
28	among	among	IN	IN	_	25	prep	_	_
29	the	the	DT	DT	_	30	det	_	_
30	Romans	roman	NNPS	NNPS	_	28	pobj	_	_
31	,	,	,	,	_	21	punct	_	_
32	as	as	IN	IN	_	35	mark	_	_
33	has	have	VBZ	VBZ	_	35	aux	_	_
34	been	be	VBN	VBN	_	35	auxpass	_	_
35	written	write	VBN	VBN	_	21	advcl	_	_
36	in	in	IN	IN	_	35	prep	_	_
37	his	his	PRP$	PRP$	_	38	poss	_	_
38	Life	life	NNP	NNP	_	36	pobj	_	_
39	)	)	-RRB-	-RRB-	_	15	punct	_	_
40	,	,	,	,	_	4	punct	_	_
41	but	but	CC	CC	_	4	cc	_	_
42	the	the	DT	DT	_	43	det	_	_
43	death	death	NN	NN	_	47	nsubj	_	_
44	of	of	IN	IN	_	43	prep	_	_
45	both	both	DT	DT	_	46	det	_	_
46	parents	parent	NNS	NNS	_	44	pobj	_	_
47	left	leave	VBD	VBD	_	4	conj	_	_
48	him	him	PRP	PRP	_	47	dobj	_	_
49	an	an	DT	DT	_	50	det	_	_
50	orphan	orphan	NN	NN	_	47	dobj	_	_
51	,	,	,	,	_	47	punct	_	_
52	together	together	RB	RB	_	53	advmod	_	_
53	with	with	IN	IN	_	47	prep	_	_
54	his	his	PRP$	PRP$	_	55	poss	_	_
55	brother	brother	NN	NN	_	53	pobj	_	_
56	Caepio	caepio	NNP	NNP	_	55	appos	_	_
57	and	and	CC	CC	_	55	cc	_	_
58	his	his	PRP$	PRP$	_	59	poss	_	_
59	sister	sister	NN	NN	_	55	conj	_	_
60	Porcia	porcia	NNP	NNP	_	59	appos	_	_
61	.	.	.	.	_	47	punct	_	_
1	Cato	cato	NNP	NNP	_	2	nsubj	_	_
2	had	have	VBD	VBD	_	53	ccomp	_	_
3	also	also	RB	RB	_	2	advmod	_	_
4	a	a	DT	DT	_	7	det	_	_
5	half	half	NN	NN	_	7	amod	_	_
6	-	-	HYPH	HYPH	_	7	punct	_	_
7	sister	sister	NN	NN	_	2	dobj	_	_
8	,	,	,	,	_	7	punct	_	_
9	Servilia	servilia	NNP	NNP	_	7	appos	_	_
10	,	,	,	,	_	7	punct	_	_
11	the	the	DT	DT	_	12	det	_	_
12	daughter	daughter	NN	NN	_	7	appos	_	_
13	of	of	IN	IN	_	12	prep	_	_
14	his	his	PRP$	PRP$	_	15	poss	_	_
15	mother.1	mother.1	NN	NN	_	13	pobj	_	_
16	All	all	PDT	PDT	_	18	predet	_	_
17	these	these	DT	DT	_	18	det	_	_
18	children	child	NNS	NNS	_	20	nsubjpass	_	_
19	were	be	VBD	VBD	_	20	auxpass	_	_
20	brought	bring	VBN	VBN	_	2	ccomp	_	_
21	up	up	RP	RP	_	20	prt	_	_
22	in	in	IN	IN	_	20	prep	_	_
23	the	the	DT	DT	_	24	det	_	_
24	home	home	NN	NN	_	22	pobj	_	_
25	of	of	IN	IN	_	24	prep	_	_
26	Livius	livius	NNP	NNP	_	27	compound	_	_
27	Drusus	drusus	NNP	NNP	_	25	pobj	_	_
28	,	,	,	,	_	27	punct	_	_
29	their	their	PRP$	PRP$	_	30	poss	_	_
30	uncle	uncle	NN	NN	_	27	appos	_	_
31	on	on	IN	IN	_	30	prep	_	_
32	the	the	DT	DT	_	33	det	_	_
33	mother	mother	NN	NN	_	35	poss	_	_
34	's	's	POS	POS	_	33	case	_	_
35	side	side	NN	NN	_	31	pobj	_	_
36	,	,	,	,	_	30	punct	_	_
37	who	who	WP	WP	_	41	nsubj	_	_
38	at	at	IN	IN	_	41	prep	_	_
39	that	that	DT	DT	_	40	det	_	_
40	time	time	NN	NN	_	38	pobj	_	_
41	was	be	VBD	VBD	_	27	relcl	_	_
42	a	a	DT	DT	_	43	det	_	_
43	leader	leader	NN	NN	_	41	attr	_	_
44	in	in	IN	IN	_	43	prep	_	_
45	the	the	DT	DT	_	46	det	_	_
46	conduct	conduct	NN	NN	_	44	pobj	_	_
47	of	of	IN	IN	_	46	prep	_	_
48	public	public	JJ	JJ	_	49	amod	_	_
49	affairs	affair	NNS	NNS	_	47	pobj	_	_
50	;	;	:	:	_	53	punct	_	_
51	for	for	IN	IN	_	53	prep	_	_
52	he	he	PRP	PRP	_	53	nsubj	_	_
53	was	be	VBD	VBD	_	0	ROOT	_	_
54	a	a	DT	DT	_	57	det	_	_
55	most	most	RBS	RBS	_	56	advmod	_	_
56	powerful	powerful	JJ	JJ	_	57	amod	_	_
57	speaker	speaker	NN	NN	_	53	attr	_	_
58	,	,	,	,	_	57	punct	_	_
59	in	in	IN	IN	_	53	prep	_	_
60	general	general	JJ	JJ	_	59	amod	_	_
61	a	a	DT	DT	_	62	det	_	_
62	man	man	NN	NN	_	53	attr	_	_
63	of	of	IN	IN	_	62	prep	_	_
64	the	the	DT	DT	_	66	det	_	_
65	greatest	great	JJS	JJS	_	66	amod	_	_
66	discretion	discretion	NN	NN	_	63	pobj	_	_
67	,	,	,	,	_	62	punct	_	_
68	and	and	CC	CC	_	53	cc	_	_
69	yielded	yield	VBD	VBD	_	53	conj	_	_
70	to	to	IN	IN	_	69	prep	_	_
71	no	no	DT	DT	_	72	det	_	_
72	Roman	roman	NN	NN	_	70	pobj	_	_
73	in	in	IN	IN	_	72	prep	_	_
74	dignity	dignity	NN	NN	_	73	pobj	_	_
75	of	of	IN	IN	_	74	prep	_	_
76	purpose	purpose	NN	NN	_	75	pobj	_	_
77	.	.	.	.	_	53	punct	_	_
1			SP	SP	_	2		_	_
2	[	[	-LRB-	-LRB-	_	3	punct	_	_
3	2	2	CD	CD	_	7	npadvmod	_	_
4	]	]	-RRB-	-RRB-	_	3	punct	_	_
5	We	we	PRP	PRP	_	7	nsubjpass	_	_
6	are	be	VBP	VBP	_	7	auxpass	_	_
7	told	tell	VBN	VBN	_	0	ROOT	_	_
8	that	that	IN	IN	_	14	mark	_	_
9	from	from	IN	IN	_	14	prep	_	_
10	his	his	PRP$	PRP$	_	12	poss	_	_
11	very	very	RB	RB	_	12	amod	_	_
12	childhood	childhood	NN	NN	_	9	pobj	_	_
13	Cato	cato	NNP	NNP	_	14	nsubj	_	_
14	displayed	display	VBD	VBD	_	7	ccomp	_	_
15	,	,	,	,	_	14	punct	_	_
16	in	in	IN	IN	_	14	prep	_	_
17	speech	speech	NN	NN	_	16	pobj	_	_
18	,	,	,	,	_	14	punct	_	_
19	in	in	IN	IN	_	14	prep	_	_
20	countenance	countenance	NN	NN	_	19	pobj	_	_
21	,	,	,	,	_	14	punct	_	_
22	and	and	CC	CC	_	14	cc	_	_
23	in	in	IN	IN	_	14	conj	_	_
24	his	his	PRP$	PRP$	_	26	poss	_	_
25	childish	childish	JJ	JJ	_	26	amod	_	_
26	sports	sport	NNS	NNS	_	23	pobj	_	_
27	,	,	,	,	_	23	advmod	_	_
28	a	a	DT	DT	_	29	det	_	_
29	nature	nature	NN	NN	_	23	pobj	_	_
30	that	that	WDT	WDT	_	31	nsubj	_	_
31	was	be	VBD	VBD	_	23	relcl	_	_
32	inflexible	inflexible	JJ	JJ	_	31	acomp	_	_
33	,	,	,	,	_	32	punct	_	_
34	imperturbable	imperturbable	JJ	JJ	_	32	conj	_	_
35	,	,	,	,	_	34	punct	_	_
36	and	and	CC	CC	_	34	cc	_	_
37	altogether	altogether	RB	RB	_	38	advmod	_	_
38	steadfast	steadfast	JJ	JJ	_	34	conj	_	_
39	.	.	.	.	_	7	punct	_	_
1	He	he	PRP	PRP	_	2	nsubj	_	_
2	set	set	VBD	VBD	_	0	ROOT	_	_
3	out	out	RP	RP	_	2	prt	_	_
4	to	to	TO	TO	_	5	aux	_	_
5	accomplish	accomplish	VB	VB	_	2	advcl	_	_
6	his	his	PRP$	PRP$	_	7	poss	_	_
7	purposes	purpose	NNS	NNS	_	5	dobj	_	_
8	with	with	IN	IN	_	5	prep	_	_
9	a	a	DT	DT	_	10	det	_	_
10	vigour	vigour	NN	NN	_	8	pobj	_	_
11	beyond	beyond	IN	IN	_	10	prep	_	_
12	his	his	PRP$	PRP$	_	13	poss	_	_
13	years	year	NNS	NNS	_	11	pobj	_	_
14	,	,	,	,	_	2	punct	_	_
15	and	and	CC	CC	_	2	cc	_	_
16	while	while	IN	IN	_	18	mark	_	_
17	he	he	PRP	PRP	_	18	nsubj	_	_
18	was	be	VBD	VBD	_	30	advcl	_	_
19	harsh	harsh	JJ	JJ	_	18	acomp	_	_
20	and	and	CC	CC	_	19	cc	_	_
21	repellent	repellent	NN	NN	_	19	conj	_	_
22	to	to	IN	IN	_	19	prep	_	_
23	those	those	DT	DT	_	22	pobj	_	_
24	who	who	WP	WP	_	26	nsubj	_	_
25	would	would	MD	MD	_	26	aux	_	_
26	flatter	flatter	VB	VB	_	23	relcl	_	_
27	him	him	PRP	PRP	_	26	dobj	_	_
28	,	,	,	,	_	30	punct	_	_
29	he	he	PRP	PRP	_	30	nsubj	_	_
30	was	be	VBD	VBD	_	2	conj	_	_
31	still	still	RB	RB	_	30	advmod	_	_
32	more	more	RBR	RBR	_	33	advmod	_	_
33	masterful	masterful	JJ	JJ	_	30	acomp	_	_
34	towards	towards	IN	IN	_	33	prep	_	_
35	those	those	DT	DT	_	34	pobj	_	_
36	who	who	WP	WP	_	37	nsubj	_	_
37	tried	try	VBD	VBD	_	35	relcl	_	_
38	to	to	TO	TO	_	39	aux	_	_
39	frighten	frighten	VB	VB	_	37	xcomp	_	_
40	him	him	PRP	PRP	_	39	dobj	_	_
41	.	.	.	.	_	30	punct	_	_
1	It	it	PRP	PRP	_	2	nsubj	_	_
2	was	be	VBD	VBD	_	0	ROOT	_	_
3	altogether	altogether	RB	RB	_	2	advmod	_	_
4	difficult	difficult	JJ	JJ	_	2	acomp	_	_
5	to	to	TO	TO	_	6	aux	_	_
6	make	make	VB	VB	_	2	xcomp	_	_
7	him	him	PRP	PRP	_	8	nsubj	_	_
8	laugh	laugh	VB	VB	_	6	ccomp	_	_
9	,	,	,	,	_	2	punct	_	_
10	although	although	IN	IN	_	16	mark	_	_
11	once	once	RB	RB	_	12	advmod	_	_
12	in	in	IN	IN	_	16	prep	_	_
13	a	a	DT	DT	_	14	det	_	_
14	while	while	NN	NN	_	12	pobj	_	_
15	he	he	PRP	PRP	_	16	nsubj	_	_
16	relaxed	relax	VBD	VBD	_	2	advcl	_	_
17	his	his	PRP$	PRP$	_	18	poss	_	_
18	features	feature	NNS	NNS	_	16	dobj	_	_
19	so	so	RB	RB	_	20	advmod	_	_
20	far	far	RB	RB	_	23	advmod	_	_
21	as	as	IN	IN	_	23	mark	_	_
22	to	to	IN	IN	_	23	aux	_	_
23	smile	smile	VB	VB	_	16	advcl	_	_
24	;	;	:	:	_	16	punct	_	_
25	and	and	CC	CC	_	16	cc	_	_
26	he	he	PRP	PRP	_	27	nsubj	_	_
27	was	be	VBD	VBD	_	32	auxpass	_	_
28	not	not	RB	RB	_	27	neg	_	_
29	quickly	quickly	RB	RB	_	32	advmod	_	_
30	nor	nor	CC	CC	_	29	cc	_	_
31	easily	easily	RB	RB	_	29	conj	_	_
32	moved	move	VBD	VBD	_	16	conj	_	_
33	to	to	IN	IN	_	32	prep	_	_
34	anger	anger	NN	NN	_	33	pobj	_	_
35	,	,	,	,	_	32	punct	_	_
36	though	though	IN	IN	_	38	mark	_	_
37	once	once	RB	RB	_	38	advmod	_	_
38	angered	anger	VBD	VBD	_	32	advcl	_	_
39	he	he	PRP	PRP	_	40	nsubj	_	_
40	was	be	VBD	VBD	_	38	ccomp	_	_
41	inexorable	inexorable	JJ	JJ	_	40	acomp	_	_
42	.	.	.	.	_	2	punct	_	_

In [ ]: